This documentation shows how to screen scrape the Hindawi APC' table

Requirements

First you will need to install Python. The easiest way to do so is to install Anaconda


In [16]:
from IPython.display import YouTubeVideo, HTML, Math, Image
# how to install anaconda on mac
YouTubeVideo('6Dv1wNvTPbg')


Out[16]:

Than you can copy every cells here or you can import this document in your anaconda folder (this part is not shown in the video I'll try to find an other video)


Import all the necessary module


In [17]:
#This module will help you retrieve the content within the Hindawi table
from bs4 import BeautifulSoup
import urllib
import pandas as pd

Retrieving the hindawi html source page


In [18]:
hindawi_apc_url = 'http://www.hindawi.com/apc/'

hindawi_html_page = urllib.urlopen(hindawi_apc_url) 

soup = BeautifulSoup(hindawi_html_page, 'xml')

In [19]:
HTML('<iframe src=http://www.hindawi.com/apc/ width=900 height=550></iframe>')


Out[19]:

This Hindawi page contains one HMTL table that we need to parse. Please look at the w3school documentation about 'table' if your not familliar with HTML


Selecting or Targeting what we want in the Hindawi's webpage


In [20]:
#Within the Table all the 'tr' (table rows)
content = soup.find_all('tr')

In [21]:
'''
<tr class="subscription_table_head">
	 <th>Journal Title</th>
	 <th>ISSN</th>
	 <th class="last_th">APC</th>
 </tr>, 
 <tr class="subscription_table_plus">
	 <td>
	   <a href="/journals/aaa/">Abstract and Applied Analysis</a>
	 </td>
	 <td>1687-0409</td>
	 <td class="to_right">$800</td>
 </tr>
 ...
 '''
0


Out[21]:
0

Because the first tr will contains the table header (Journal title, ISSN, APC) we will start retrieving content after the first tr.


In [22]:
table =[]

#start with the second 'tr'
for value in content[1:]:
    #This will find all the td within this 'tr'
    value = value.find_all('td')
    
    #index of VALUE: 0                ,     1,         2
    #value ===> 'Abstract and Applied', '1687-0429', '$800'
    
    apc = value[2].text.strip()
    #Let's remove the '$' sign if any
    if "$" in apc:
        apc = (apc.split('$'))[1]
        apc = int(apc)
    #if value == 'Free' than let's write 0 instead of Free
    else:
        apc = 0
    table.append([value[0].text.strip(),value[1].text.strip(),apc] )

In [23]:
hindawi_apc_table = pd.DataFrame(table, columns=['Journal Title','ISSN','APC'])

Let's display the first 10 values


In [24]:
hindawi_apc_table.head(10)


Out[24]:
Journal Title ISSN APC
0 Abstract and Applied Analysis 1687-0409 800
1 Active and Passive Electronic Components 1563-5031 600
2 Advances in Acoustics and Vibration 1687-627X 600
3 Advances in Aerospace Engineering 2314-7520 600
4 Advances in Agriculture 2314-7539 600
5 Advances in Anatomy 2314-7547 600
6 Advances in Andrology 2314-8446 600
7 Advances in Anesthesiology 2314-7555 600
8 Advances in Artificial Intelligence 1687-7489 0
9 Advances in Artificial Neural Systems 1687-7608 600

Export the table to an Excel file


In [25]:
# Export to Excel
hindawi_apc_table.to_excel('Hindawi_apc_table.xlsx', sheet_name = 'Hindawi_APC_Table', index = False)

In resume to scrape the Table. You will need these steps:

  1. Import all the necessary module: (BeautifulSoup and Pandas)
  2. Retrieving the html page: soup = BeautifulSoup(hindawi_html_page, 'xml'
  3. Targeting the table
  4. Within this table, you will need to loop through to retrieve what you are looking for